{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "muiqKarukWj0"
      },
      "outputs": [],
      "source": [
        "# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
        "\n",
        "# Licensed to the Apache Software Foundation (ASF) under one\n",
        "# or more contributor license agreements. See the NOTICE file\n",
        "# distributed with this work for additional information\n",
        "# regarding copyright ownership. The ASF licenses this file\n",
        "# to you under the Apache License, Version 2.0 (the\n",
        "# \"License\"); you may not use this file except in compliance\n",
        "# with the License. You may obtain a copy of the License at\n",
        "#\n",
        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing,\n",
        "# software distributed under the License is distributed on an\n",
        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
        "# KIND, either express or implied. See the License for the\n",
        "# specific language governing permissions and limitations\n",
        "# under the License"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZUSiAR62SgO8"
      },
      "source": [
        "# Generate text embeddings by using the Vertex AI API\n",
        "\n",
        "<table align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb\"><img src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb\"><img src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\" />View source on GitHub</a>\n",
        "  </td>\n",
        "</table>\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bkpSCGCWlqAf"
      },
      "source": [
        "Text embeddings are a way to represent text as numerical vectors. This process lets computers understand and process text data, which is essential for many natural language processing (NLP) tasks.\n",
        "\n",
        "The following NLP tasks use embeddings:\n",
        "\n",
        "* **Semantic search:** Find documents or passages that are relevant to a query when the query doesn't use the exact same words as the documents.\n",
        "* **Text classification:** Categorize text data into different classes, such as spam and not spam, or positive sentiment and negative sentiment.\n",
        "* **Machine translation:** Translate text from one language to another and preserve the meaning.\n",
        "* **Text summarization:** Create shorter summaries of text.\n",
        "\n",
        "This notebook uses the Vertex AI text-embeddings API to generate text embeddings that use Google’s large generative artificial intelligence (AI) models. To generate text embeddings by using the Vertex AI text-embeddings API, use `MLTransform` with the `VertexAITextEmbeddings` class to specify the model configuration. For more information, see [Get text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) in the Vertex AI documentation. \n",
        "\n",
        "For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation.\n",
        "\n",
        "## Requirements\n",
        "\n",
        "To use the Vertex AI text-embeddings API, complete the following prerequisites:\n",
        "\n",
        "* Install the `google-cloud-aiplatform` Python package.\n",
        "* Do one of the following tasks:\n",
        "  * Configure credentials for your Google Cloud project. For more information, see [Google Auth Library for Python](https://googleapis.dev/python/google-auth/latest/reference/google.auth.html#module-google.auth).\n",
        "  * Store the path to a service account JSON file by using the [GOOGLE_APPLICATION_CREDENTIALS](https://cloud.google.com/docs/authentication/application-default-credentials#GAC) environment variable."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W29FgO5Qv2ew"
      },
      "source": [
        "To use your Google Cloud account, authenticate this notebook."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nYyyGYt3licq"
      },
      "outputs": [],
      "source": [
        "from google.colab import auth\n",
        "auth.authenticate_user()\n",
        "\n",
        "# Replace <PROJECT_ID> with a valid Google Cloud project ID.\n",
        "project = '<PROJECT_ID>' # @param {type:'string'}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UQROd16ZDN5y"
      },
      "source": [
        "## Install dependencies\n",
        " Install Apache Beam and the dependencies required for the Vertex AI text-embeddings API."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "BTxob7d5DLBM"
      },
      "outputs": [],
      "source": [
        "! pip install apache_beam[interactive,gcp]>=2.53.0 --quiet"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "SkMhR7H6n1P0"
      },
      "outputs": [],
      "source": [
        "import tempfile\n",
        "import apache_beam as beam\n",
        "from apache_beam.ml.transforms.base import MLTransform\n",
        "from apache_beam.ml.transforms.embeddings.vertex_ai import VertexAITextEmbeddings"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cokOaX2kzyke"
      },
      "source": [
        "## Transform the data\n",
        "\n",
        "`MLTransform` is a `PTransform` that you can use for data preparation, including generating text embeddings.\n",
        "\n",
        "### Use MLTransform in write mode\n",
        "\n",
        "In `write` mode, `MLTransform` saves the transforms and their attributes to an artifact location. Then, when you run `MLTransform` in `read` mode, these transforms are used. This process ensures that you're applying the same preprocessing steps when you train your model and when you serve the model in production or test its accuracy."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-x7fVvuy-aDs"
      },
      "source": [
        "### Get the data\n",
        "\n",
        "`MLTransform` processes dictionaries that include column names and their associated text data. To generate embeddings for specific columns, specify these column names in the `columns` argument of `VertexAITextEmbeddings`. This transform uses the the Vertex AI text-embeddings API for online predictions to generate an embeddings vector for each sentence."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "be-vR159pylF"
      },
      "outputs": [],
      "source": [
        "artifact_location = tempfile.mkdtemp(prefix='vertex_ai')\n",
        "\n",
        "# Use the latest text embedding model from the Vertex AI text-embeddings API documentation.\n",
        "# https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text-embeddings\n",
        "text_embedding_model_name = 'textembedding-gecko@latest'\n",
        "\n",
        "# Generate text embeddings on the sentences.\n",
        "content = [\n",
        "    {\n",
        "        'x' : 'I would like embeddings for this text'\n",
        "    },\n",
        "    {\n",
        "        'x' : 'Hello world'\n",
        "    },\n",
        "    {\n",
        "        'x': 'The Dog is running in the park.'\n",
        "    }\n",
        "  ]\n",
        "\n",
        "# helper function that returns a dict containing only first\n",
        "# ten elements of generated embeddings\n",
        "def truncate_embeddings(d):\n",
        "  for key in d.keys():\n",
        "    d[key] = d[key][:10]\n",
        "  return d"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "UQGm1be3p7lM",
        "outputId": "b41172ca-1c73-4952-ca87-bfe45ca88a6c"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'x': [0.041293490678071976, -0.010302993468940258, -0.048611514270305634, -0.01360565796494484, 0.06441926211118698, 0.022573700174689293, 0.016446372494101524, -0.033894773572683334, 0.004581860266625881, 0.060710687190294266]}\n",
            "Embedding shape: 10\n",
            "{'x': [0.05889148637652397, -0.0046180677600204945, -0.06738516688346863, -0.012708292342722416, 0.06461101770401001, 0.025648491457104683, 0.023468563333153725, -0.039828114211559296, -0.009968819096684456, 0.050098177045583725]}\n",
            "Embedding shape: 10\n",
            "{'x': [0.04683901369571686, -0.013076924718916416, -0.082594133913517, -0.01227626483887434, 0.00417641457170248, -0.024504298344254494, 0.04282262548804283, -0.0009824123699218035, -0.02860993705689907, 0.01609829254448414]}\n",
            "Embedding shape: 10\n"
          ]
        }
      ],
      "source": [
        "embedding_transform = VertexAITextEmbeddings(\n",
        "        model_name=text_embedding_model_name, columns=['x'], project=project)\n",
        "\n",
        "with beam.Pipeline() as pipeline:\n",
        "  data_pcoll = (\n",
        "      pipeline\n",
        "      | \"CreateData\" >> beam.Create(content))\n",
        "  transformed_pcoll = (\n",
        "      data_pcoll\n",
        "      | \"MLTransform\" >> MLTransform(write_artifact_location=artifact_location).with_transform(embedding_transform))\n",
        "\n",
        "  # Show only the first ten elements of the embeddings to prevent clutter in the output.\n",
        "  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> beam.Map(print)\n",
        "\n",
        "  transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x: print(f\"Embedding shape: {len(x['x'])}\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JLkmQkiLx_6h"
      },
      "source": [
        "### Use MLTransform in read mode\n",
        "\n",
        "In `read` mode, `MLTransform` uses the artifacts saved during `write` mode. In this example, the transform and its attributes are loaded from the saved artifacts. You don't need to specify artifacts again during `read` mode.\n",
        "\n",
        "In this way, `MLTransform` provides consistent preprocessing steps for training and inference workloads."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "r8Y5vgfLx_Xu",
        "outputId": "e7cbf6b7-5c31-4efa-90cf-7a8a108ecc77"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'x': [0.04782044142484665, -0.010078949853777885, -0.05793016776442528, -0.026060665026307106, 0.05756739526987076, 0.02292264811694622, 0.014818413183093071, -0.03718176111578941, -0.005486017093062401, 0.04709304869174957]}\n",
            "{'x': [0.042911216616630554, -0.007554919924587011, -0.08996245265007019, -0.02607591263949871, 0.0008614308317191899, -0.023671219125390053, 0.03999944031238556, -0.02983051724731922, -0.015057179145514965, 0.022963201627135277]}\n"
          ]
        }
      ],
      "source": [
        "test_content = [\n",
        "    {\n",
        "        'x': 'This is a test sentence'\n",
        "    },\n",
        "    {\n",
        "        'x': 'The park is full of dogs'\n",
        "    },\n",
        "]\n",
        "\n",
        "with beam.Pipeline() as pipeline:\n",
        "  data_pcoll = (\n",
        "      pipeline\n",
        "      | \"CreateData\" >> beam.Create(test_content))\n",
        "  transformed_pcoll = (\n",
        "      data_pcoll\n",
        "      | \"MLTransform\" >> MLTransform(read_artifact_location=artifact_location))\n",
        "\n",
        "  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> beam.Map(print)\n"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
